Sampling Techniques: Resilience in young adults with relation to received and perceived Social support¶

Name: Krish Agarwal
Reg No: 21112016
Class: 4BSc DS A

CIA3 Comp2¶

About the Psycological Evaluation¶

I conducted a psychological evaluation with students of CHRIST (Deemed to be University) Pune, Lavasa Campus as the target population for the same. The evaluation aimed at calculating the Resilience and Social Support Score for individuals.
Link to the form: https://forms.gle/XKhTci8azE9GVMkT9

Social Support: A network of family, friends, neighbors, and community members that is available in times of need to give psychological, physical, and financial help.
Resilience: The ability to cope with and recover from setbacks.

After the scores were calculated, suitable type of analysis was done on the data to predict the Resilience Score from Social Support variables. The project primarily focuses on sampling techniques and prediction/estimmation. The Questionnaire selected for this evaluation has already been used in publications and have a valid and verified scoring technique.

The links to the questionnaire and their respective scoring techniques are listed below:

  1. Social Support: https://elcentro.sonhs.miami.edu/research/measures-library/mspss/index.html#:~:text=The%20Multidimensional%20Scale%20of%20Perceived,%2C%205%20%3D%20strongly%20agree
  2. Resilience: https://www.nwpgmd.nhs.uk/sites/default/files/resiliencequestionnaire.pdf

Problem Definition:¶

Collect data from students and calculate their Resilience and Social Support Score. The data collected is considered to be noisy and hence, imply various functions offered by libraries to clean the same and gain meaningful insightfuls.

Approach/Method:¶

  1. Collecting survey throughout the campus to gather data.
  2. Importing and structuring the data onto pandas dataframe.
  3. Renaming the columns for easy computation.
  4. Creating user-defined functions for data conversion to accepted scale.
  5. Extensive Data Visualization for useful insights.
  6. Performing the suitable sampling technique to derive a sample from the population.
  7. Performing Hypothesis/Prediction on the data.

Observation:¶

  1. Majority of the population belong to Gujarat which is followed up people from Maharashtra.
  2. We can see that there is a majority of Males followed by Females and lastly Others.
  3. Majority of the students fall under the 'Moderate' category under Social Support followed by ‘High’ and then ‘Low’.
  4. Majority of the students belong in the 'Developing' under the Resilience Status followed by 'Established', 'Strong' and then 'Exceptional'.
  5. There is a slight linear relation between 'SS Score' and 'R Score'. 6.Under the SS Status, 'Low', 'Moderate' and 'High' are all majorly comprised of Females.
  6. Under R Status, Males majorly comprise of every status except 'Established' which is majorly comprised by Females.
  7. Students from ‘West Bengal’ have high Social Support score.
  8. Students from 'West Bengal' and 'Assam' have high Resilience Support Score.
  9. It is observed that the highest R Score belongs to a Male from Rajasthan whereas the lowest R Score belongs to a 1 Male from Tamil Nadu and 2 Female from Maharashtra and Gujarat.
  10. It is observed that the highest SS Score belongs to a Female from Gujarat whereas the lowest SS Score belongs to a Male from Nagaland.

Results:¶

  1. The highest R Score belongs to a Male from Gujarat whereas the lowest R Score belongs to a Female from Telangana.
  2. The highest SS Score belongs to a Male from Gujarat whereas the lowest SS Score also belongs to a Male from Gujarat.

References:¶

  1. Stackoverflow
  2. Geeks for Geeks
  3. Youtube
  4. Javapoint
  5. W3school
  6. Medium

Code:¶

In [51]:
# Importing all the necessary libraries/modules 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from krishKiLibrary import countUnique
import plotly.express as px
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

Importing data onto python¶

In [2]:
df = pd.read_csv("D:/Z/CU/SEM3/Data Analytics/CIA/EndSem/Datasets/21112016_KrishAgarwal_CleanedDataset.csv")
df = df.drop([df.columns[0]], axis = 1) # Dropping unnecessary columns
In [3]:
df
Out[3]:
Age Gender State SS1 SS2 SS3 SS4 SS5 SS6 SS7 ... R7 R8 R9 R10 R11 R12 SS Score SS Status R Score R Status
0 19 Female Telangana 5 3 6 5 4 6 5 ... 1 1 1 3 1 1 4.916667 Moderate 18 Developing
1 19 Female Jharkhand 4 6 6 6 6 6 6 ... 2 3 3 3 3 3 5.833333 High 38 Established
2 19 Male Chhattisgarh 4 5 7 6 4 5 5 ... 4 3 5 5 4 4 5.250000 High 52 Exceptional
3 19 Male Maharashtra 5 6 6 6 6 6 6 ... 1 2 3 4 4 3 5.250000 High 35 Developing
4 19 Male Andhra Pradesh 6 4 6 4 4 4 4 ... 3 3 3 2 3 4 4.333333 Moderate 36 Developing
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
103 21 Male West Bengal 5 5 6 5 1 1 1 ... 1 2 2 1 2 2 2.750000 Low 23 Developing
104 20 Male Assam 7 4 6 4 6 6 6 ... 3 1 4 5 4 3 5.416667 High 50 Exceptional
105 19 Male Gujarat 7 7 7 7 7 7 7 ... 5 5 5 5 5 5 7.000000 High 60 Exceptional
106 16 Male Gujarat 1 1 1 1 1 1 5 ... 5 5 5 5 5 5 2.083333 Low 40 Established
107 17 Female Delhi 7 6 5 5 5 4 3 ... 1 2 4 4 2 4 4.916667 Moderate 44 Strong

108 rows × 31 columns

About the data¶

The dataframe consists of 108 entries in total which were collected from students pursuing their degrees at CHRIST (Deemed to be University) Pune, Lavasa Campus. The data is spread across 20 different states.
The columns 'SS' and 'R' stand for questions from Social Support and Resilience questionnaires respectively. Both contains of 12 features each. The Final Scores and their categories have been computed from the scoring techniques mentioned in their questionnaires.

Performing Proportionate Stratified Sampling

  • What is Stratified Sampling?
    Stratified sampling is a random sampling method of dividing the population into various subgroups or strata and drawing a random sample from each. Each subgroup or stratum consists of items that have common characteristics. This sampling method is widely used in human research or political surveys.

  • Types of Stratified Sampling:
  1. Equal Allocation
  2. Proportionate Allocation
  3. Neyman Allocation
  4. Optimal Allocation


  • Why do we choose Proportionate Stratified Sampling here?
    Proportionate stratified sampling is used to ensure that each subgroup in a population is represented in the sample proportionally to its size in the population. In proportionate sampling, the sample size of each stratum is equal to the subgroup’s proportion in the population as a whole.

  • Advantages of Proportionate Stratified Sampling:
  1. Increased representativenes
  2. Reduced sampling error
  3. Increased Precision
  4. Administrative Convenience


  • Disadvantages of Proportionate Stratified Sampling:
  1. Increased Complexity
  2. Potential Bias
  3. Reduced Variability
  4. Increased Cost

Stratified-Sampling.jpg

Population Size: 108
Sample Size (Pre-Defined): 30
Elements from Each Stratum: $n_i = n\frac{N_i}{N}$

Creating Stratas on the basis of 'R Status'¶

In [4]:
stratas = countUnique(df, df['R Status'].unique(), 'R Status')
print('Stratas:', stratas)

'''
n1 --> Developing
n2 --> Established
n3 --> Exceptional
n4 --> Strong
'''
Stratas: {'Developing': 39, 'Established': 35, 'Exceptional': 12, 'Strong': 22}
Out[4]:
'\nn1 --> Developing\nn2 --> Established\nn3 --> Exceptional\nn4 --> Strong\n'

Creating a User-Defined Function for Sampling¶

In [5]:
def propStratifiedSampling(df, column, sample_size):
    import pandas as pd
    
    # Count of elements in each category
    stratas_ = {}
    for i in range(len(df[column].unique())):
        k = 0
        for j in range(len(df[column])):
            if df[column][j] == df[column].unique()[i]:
                k += 1
        stratas_[df[column].unique()[i]] = k
    
    # Defining pop_size and additional dataframe
    population_size = len(df)
    sample_df = pd.DataFrame()
    
    # Calculating number of elements from each stratum to be taken into sample
    for i in range(len(stratas_)):
        n_i = round(sample_size*(list(stratas_.values())[i]/population_size))
        
        # adding n_i no.of random elements from stratums into sample dataframe 
        df_ = df[df[column] == list(stratas_.keys())[i]].sample(frac = 1)[0:n_i]
        sample_df = pd.concat([sample_df, df_])
    
    # Shuffling the dataframe
    sample_df = sample_df.sample(frac = 1)
    # Resetting the Index
    sample_df.reset_index(inplace = True, drop = True)    
    return sample_df
In [6]:
sample = propStratifiedSampling(df, 'R Status', 30)
sample
Out[6]:
Age Gender State SS1 SS2 SS3 SS4 SS5 SS6 SS7 ... R7 R8 R9 R10 R11 R12 SS Score SS Status R Score R Status
0 18 Male Tamil Nadu 5 3 5 4 3 4 3 ... 1 5 3 4 2 1 3.833333 Moderate 32 Developing
1 19 Male Maharashtra 5 6 6 6 6 6 6 ... 1 2 3 4 4 3 5.250000 High 35 Developing
2 19 Male Tamil Nadu 4 4 4 4 4 4 4 ... 3 3 4 4 4 3 4.000000 Moderate 46 Strong
3 19 Female Rajasthan 7 6 6 7 7 5 5 ... 1 4 4 5 3 3 6.083333 High 35 Developing
4 19 Female Uttar Pradesh 5 5 5 5 5 5 5 ... 4 1 3 4 3 3 5.000000 Moderate 36 Developing
5 19 Others Nagaland 1 2 2 4 1 4 1 ... 2 2 3 3 2 5 1.666667 Low 35 Developing
6 19 Female Maharashtra 4 5 5 3 4 6 6 ... 2 3 3 4 4 3 4.750000 Moderate 39 Established
7 20 Male Gujarat 5 6 6 6 6 6 2 ... 2 4 2 4 2 4 4.583333 Moderate 38 Established
8 20 Male Bihar 3 2 6 6 2 2 2 ... 3 3 3 3 3 3 3.416667 Moderate 36 Developing
9 20 Male Bihar 4 2 5 5 5 1 5 ... 2 4 5 3 3 4 4.666667 Moderate 35 Developing
10 19 Male West Bengal 5 6 6 6 6 6 6 ... 4 4 4 4 4 4 5.916667 High 46 Strong
11 18 Male Delhi 7 6 6 6 6 6 6 ... 2 2 4 4 2 3 5.583333 High 41 Established
12 23 Male Maharashtra 7 7 7 6 7 6 6 ... 5 1 3 4 5 5 6.500000 High 49 Exceptional
13 23 Female Uttar Pradesh 4 4 4 6 4 6 2 ... 1 5 4 3 5 3 4.500000 Moderate 36 Developing
14 18 Male Tamil Nadu 2 2 6 5 2 2 2 ... 2 2 4 2 4 4 3.083333 Moderate 42 Established
15 19 Female Puducherry 4 7 2 4 7 4 3 ... 3 4 3 4 4 4 4.083333 Moderate 44 Strong
16 19 Male Rajasthan 4 4 4 3 3 6 7 ... 5 1 5 5 5 5 4.500000 Moderate 56 Exceptional
17 19 Male Telangana 1 1 5 5 1 6 6 ... 5 3 4 3 4 1 3.666667 Moderate 39 Established
18 16 Female Gujarat 3 5 4 4 5 5 3 ... 2 2 2 4 3 3 4.000000 Moderate 33 Developing
19 22 Male Maharashtra 3 3 6 5 7 7 6 ... 4 3 4 4 4 5 5.250000 High 49 Exceptional
20 16 Male Gujarat 6 5 6 3 6 6 7 ... 4 3 5 4 4 3 5.500000 High 43 Established
21 19 Male Maharashtra 4 5 5 5 5 5 1 ... 3 3 3 4 4 4 4.416667 Moderate 44 Strong
22 16 Male Gujarat 1 1 1 1 1 1 5 ... 5 5 5 5 5 5 2.083333 Low 40 Established
23 16 Female Maharashtra 5 5 6 5 6 6 6 ... 3 2 3 3 2 3 5.666667 High 32 Developing
24 18 Female Tamil Nadu 6 4 4 5 6 5 6 ... 4 5 3 2 5 4 5.083333 High 43 Established
25 18 Male Gujarat 5 7 3 1 6 7 7 ... 2 4 4 1 1 3 5.416667 High 36 Developing
26 49 Female Gujarat 3 7 6 6 7 6 6 ... 4 2 4 5 4 4 5.833333 High 42 Established
27 21 Female Gujarat 6 7 7 7 7 6 6 ... 3 5 3 4 5 4 6.500000 High 46 Strong
28 18 Male Telangana 5 7 5 5 7 7 7 ... 4 2 3 4 4 4 6.250000 High 45 Strong
29 19 Female Gujarat 5 5 2 2 5 6 5 ... 2 3 4 5 2 2 4.166667 Moderate 43 Established

30 rows × 31 columns

EDA on the sample¶

In [7]:
# --- 1) Finding outliers via a boxplot in the 'Age' column ---
sns.set(rc={'figure.figsize':(10, 5)})
sns.boxplot(x = sample["Age"]).set(title='Age Distribution')
Out[7]:
[Text(0.5, 1.0, 'Age Distribution')]

Inference: We find that there are outliers which do not fall in our area of research. So, we deal with each of them accordingly and drop the values which are supposed to be dropped.

In [8]:
# --- 2) Count of elemets in each column ---
col_list = df.drop(['Gender', 'State', 'SS Status', 'R Status'], axis = 1)

plt.figure(figsize=(20,20))
column_list = col_list[2:]
plt_num = 1

for i in column_list:
    if plt_num<=18:
        plt.subplot(6, 6, plt_num)
        sns.histplot(df[i])
        plt_num = plt_num+1
    else:
        plt.subplot(6, 6, plt_num)
        sns.histplot(df[i])
        plt_num = plt_num+1
plt.tight_layout()

Inference: From the above graph, we get a rough idea about the count in each column.¶

In [9]:
# --- 3) Percentage of people in each State ---
state_unique = list(sample['State'].unique())
state_data = []
for i in range(len(state_unique)):
  state_data.append(len(sample[sample['State'] == state_unique[i]]))

# Wedge properties
wp = { 'linewidth' : 1, 'edgecolor' : "green" }
 
# Creating autocpt arguments
def func(pct, allvalues):
    absolute = int(pct / 100.*np.sum(allvalues))
    return "{:.1f}%\n({:d} g)".format(pct, absolute)
 
# Creating plot
fig, ax = plt.subplots(figsize =(30, 15))
wedges, texts, autotexts = ax.pie(state_data,
                                  autopct = lambda pct: func(pct, state_data),
                                  labels = state_unique,
                                  shadow = True,
                                  startangle = 90,
                                  wedgeprops = wp,
                                  textprops = dict(color ="black"))
 
# Adding legend
ax.legend(wedges, state_data,
          title = "States",
          loc ="center left",
          bbox_to_anchor =(1, 0, 0.5, 1))
 
plt.setp(autotexts, size = 8, weight = "bold")
ax.set_title("Percentage of students in each State")
 
# show plot
plt.show()

Inference: We can see that majority of the people who filled the form belong to Gujarat which is followed up people from Maharashtra.

In [10]:
# --- 4) Number of people belonging to each gender ---
plt.hist(sample['Gender'])
plt.title('Gender Count')
Out[10]:
Text(0.5, 1.0, 'Gender Count')

Inference: We can see that there is a majority of Males followed by Females and lastly others.

In [11]:
# --- 5) Percentage of People belonging in Each Social Support Category ---
ss_status_unique = list(sample['SS Status'].unique())
ss_status_data = []
for i in range(len(ss_status_unique)):
  ss_status_data.append(len(sample[sample['SS Status'] == ss_status_unique[i]]))

# Creating plot
fig = plt.figure(figsize =(10, 7))
plt.pie(ss_status_data, labels = ss_status_unique, autopct='%1.1f%%')
 
# show plot
plt.title('Distribution of Students in Social Support Categories')
plt.show()

Inference: Majority of the students fall under the 'Moderate' category in their Social Support domain.

In [12]:
# --- 6) Percentage of People belonging in Each Resilience Category ---
r_status_unique = list(sample['R Status'].unique())
r_status_data = []
for i in range(len(r_status_unique)):
  r_status_data.append(len(sample[sample['R Status'] == r_status_unique[i]]))

# Creating plot
fig = plt.figure(figsize =(10, 7))
plt.pie(r_status_data, labels = r_status_unique, autopct='%1.1f%%')
 
# show plot
plt.show()

Inference: We infer from the Graph that majority of them belong in the 'Developing' domain.

In [13]:
# --- 7) Correlation Heatmap ---
sns.set(rc={'figure.figsize':(25, 10)})
corr_map = sns.heatmap(sample.corr().round(2), annot=True)
In [14]:
# --- 8) SS Score vs R Score ---
sns.regplot(data = df, x = 'SS Score', y = 'R Score').set(title = 'SS Score vs R Score')
Out[14]:
[Text(0.5, 1.0, 'SS Score vs R Score')]

Inference: It is observed that there is a slight linear relation between 'SS Score' and 'R Score'. (The same was hinted by the correlation heatmap)

In [17]:
# --- 9) SS Status vs Gender ---
gender_unique = list(sample['Gender'].unique())
gender_ss_label = ['Female_Mod', "Male_Mod", "Others_Mod", "Female_High", "Male_High", "Others_High", "Female_Low", "Male_Low", "Others_Low"]

gender_ss_data = []
for i in range(len(ss_status_unique)):
  for j in range(len(gender_unique)):
    gender_ss_data.append(len(sample[(sample['SS Status'] == ss_status_unique[i]) & (sample['Gender']  == gender_unique[j])]))
    
# Plot
sns.swarmplot(x = gender_ss_data, y = gender_ss_label, palette = "deep")
Out[17]:
<AxesSubplot:>

Inference:
Low Status count: Male > Female = Others
Moderate Status count: Female > Male > Others
High Status count: Female > Male > Others

In [19]:
# --- 10) R Status vs Gender ---
gender_r_label = ['Female_Dev', "Male_Dev", "Others_Dev", "Female_Est", "Male_Est", "Others_Est", "Female_Exc", "Male_Exc", "Others_Exc", "Female_Str", "Male_Str", "Others_Str"]

gender_r_data = []
for i in range(len(r_status_unique)):
  for j in range(len(gender_unique)):
    gender_r_data.append(len(sample[(sample['R Status'] == r_status_unique[i]) & (df['Gender']  == gender_unique[j])]))
    
sns.swarmplot(x = gender_r_data, y = gender_r_label, palette = "deep").set(title = 'R Status vs Gender')
C:\Users\KRISH\AppData\Local\Temp\ipykernel_27516\177005251.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  gender_r_data.append(len(sample[(sample['R Status'] == r_status_unique[i]) & (df['Gender']  == gender_unique[j])]))
C:\Users\KRISH\AppData\Local\Temp\ipykernel_27516\177005251.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  gender_r_data.append(len(sample[(sample['R Status'] == r_status_unique[i]) & (df['Gender']  == gender_unique[j])]))
C:\Users\KRISH\AppData\Local\Temp\ipykernel_27516\177005251.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  gender_r_data.append(len(sample[(sample['R Status'] == r_status_unique[i]) & (df['Gender']  == gender_unique[j])]))
C:\Users\KRISH\AppData\Local\Temp\ipykernel_27516\177005251.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  gender_r_data.append(len(sample[(sample['R Status'] == r_status_unique[i]) & (df['Gender']  == gender_unique[j])]))
C:\Users\KRISH\AppData\Local\Temp\ipykernel_27516\177005251.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  gender_r_data.append(len(sample[(sample['R Status'] == r_status_unique[i]) & (df['Gender']  == gender_unique[j])]))
C:\Users\KRISH\AppData\Local\Temp\ipykernel_27516\177005251.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  gender_r_data.append(len(sample[(sample['R Status'] == r_status_unique[i]) & (df['Gender']  == gender_unique[j])]))
C:\Users\KRISH\AppData\Local\Temp\ipykernel_27516\177005251.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  gender_r_data.append(len(sample[(sample['R Status'] == r_status_unique[i]) & (df['Gender']  == gender_unique[j])]))
C:\Users\KRISH\AppData\Local\Temp\ipykernel_27516\177005251.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  gender_r_data.append(len(sample[(sample['R Status'] == r_status_unique[i]) & (df['Gender']  == gender_unique[j])]))
C:\Users\KRISH\AppData\Local\Temp\ipykernel_27516\177005251.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  gender_r_data.append(len(sample[(sample['R Status'] == r_status_unique[i]) & (df['Gender']  == gender_unique[j])]))
C:\Users\KRISH\AppData\Local\Temp\ipykernel_27516\177005251.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  gender_r_data.append(len(sample[(sample['R Status'] == r_status_unique[i]) & (df['Gender']  == gender_unique[j])]))
C:\Users\KRISH\AppData\Local\Temp\ipykernel_27516\177005251.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  gender_r_data.append(len(sample[(sample['R Status'] == r_status_unique[i]) & (df['Gender']  == gender_unique[j])]))
C:\Users\KRISH\AppData\Local\Temp\ipykernel_27516\177005251.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  gender_r_data.append(len(sample[(sample['R Status'] == r_status_unique[i]) & (df['Gender']  == gender_unique[j])]))
Out[19]:
[Text(0.5, 1.0, 'R Status vs Gender')]

Inference:
Developing Status count: Male > Female > Others
Established Status count: Female > Male > Others
Exceptional Status count: Male > Female > Others
Strong Status count: Male > Female > Others

In [20]:
# --- 11) SS Score vs State ---
sns.set(rc={'figure.figsize':(40, 10)})
sns.set_context("paper", font_scale=2)                                                  
bar1 = sns.barplot(data = sample, x = "State", y = "SS Score", ci = None)

for item in bar1.get_xticklabels():
    item.set_rotation(45)
bar1.set(title = 'SS Status vs State')
Out[20]:
[Text(0.5, 1.0, 'SS Status vs State')]

Inference: It is observed that people from 'West Bengal' have high Social Support Score.

In [21]:
# --- 12) R Score vs State ---
sns.set(rc={'figure.figsize':(40, 10)})
sns.set_context("paper", font_scale=2)                                                  
bar2 = sns.barplot(data = sample, x = "State", y = "R Score", ci = None)

for item in bar2.get_xticklabels():
    item.set_rotation(45)
bar2.set(title = "R Score vs State")
Out[21]:
[Text(0.5, 1.0, 'R Score vs State')]

Inference: It is observed that people from 'West Bengal' have high Resilience Score.

In [22]:
# --- 13) R Score vs State vs Gender ---
bar3 = sns.catplot(data = sample, x = "State", y = "R Score", hue = 'Gender' , height = 9, aspect = 4, ci = None)
bar3.set_xticklabels(rotation=30)
bar3.set(title = 'R Score vs State vs Gender')
Out[22]:
<seaborn.axisgrid.FacetGrid at 0x24b9e9af6a0>

Inference: It is observed that the R Score belongs to a Male from Rajasthan whereas the lowest R Score belongs to a 1 Male from Tamil Nadu and 2 Female from Maharashtra and Gujarat.

In [23]:
# --- 14) SS Score vs State vs Gender ---
bar4 = sns.catplot(data = sample, x = "State", y = "SS Score", hue = 'Gender' , height = 9, aspect = 4, ci = None)
bar4.set_xticklabels(rotation=30)
bar4.set(title = 'SS Score vs State vs Gender')
Out[23]:
<seaborn.axisgrid.FacetGrid at 0x24b9ee00fa0>

Inference: It is observed that the highest SS Score belongs to a Female from Gujarat whereas the lowest SS Score belongs to a Male from Nagaland.

In [24]:
# --- 17) R Score on the Indian map ---
fig = px.choropleth(
    sample,
    geojson = "https://gist.githubusercontent.com/jbrobst/56c13bbbf9d97d187fea01ca62ea5112/raw/e388c4cae20aa53cb5090210a42ebb9b765c0a36/india_states.geojson",
    featureidkey = 'properties.ST_NM',
    locations = 'State',
    color = 'R Score',
    color_continuous_scale='Reds',
    #mapbox_style="carto-positron",
)

fig.update_geos(fitbounds="locations", visible=False)

fig.show()

Inference: The above graph shows the count of R Score in each State.

In [25]:
# --- 18) SS Score on the Indian map ---
fig = px.choropleth(
    sample,
    geojson = "https://gist.githubusercontent.com/jbrobst/56c13bbbf9d97d187fea01ca62ea5112/raw/e388c4cae20aa53cb5090210a42ebb9b765c0a36/india_states.geojson",
    featureidkey = 'properties.ST_NM',
    locations = 'State',
    color = 'SS Score',
    color_continuous_scale='Reds',
    # mapbox_style="carto-positron",
)

fig.update_geos(fitbounds="locations", visible=False)

fig.show()

Inference: The above graph shows the count of SS Score in each State.

Predicting R Score from SS Variables¶

Predicting Resilience Score from Social Support variables (SS1, SS2...SS12, SS Score).

In [26]:
df.head()
Out[26]:
Age Gender State SS1 SS2 SS3 SS4 SS5 SS6 SS7 ... R7 R8 R9 R10 R11 R12 SS Score SS Status R Score R Status
0 19 Female Telangana 5 3 6 5 4 6 5 ... 1 1 1 3 1 1 4.916667 Moderate 18 Developing
1 19 Female Jharkhand 4 6 6 6 6 6 6 ... 2 3 3 3 3 3 5.833333 High 38 Established
2 19 Male Chhattisgarh 4 5 7 6 4 5 5 ... 4 3 5 5 4 4 5.250000 High 52 Exceptional
3 19 Male Maharashtra 5 6 6 6 6 6 6 ... 1 2 3 4 4 3 5.250000 High 35 Developing
4 19 Male Andhra Pradesh 6 4 6 4 4 4 4 ... 3 3 3 2 3 4 4.333333 Moderate 36 Developing

5 rows × 31 columns

In [28]:
# Pre-processing Gender columnn through LabelEncoder
gender_le = LabelEncoder()
df['Gender'] = gender_le.fit_transform(df['Gender'])

# 0 --> Female
# 1 --> Male
In [32]:
# Pre-processing Gender columnn through LabelEncoder
state_le = LabelEncoder()
df.State = state_le.fit_transform(df.State)
In [37]:
# Defining Dependable and Independable variables
X = df.drop(['R1', 'R2', 'R3', 'R4', 'R5', 'R6', 'R7', 'R8', 'R9', 'R10', 'R11', 'R12','SS Status', 'R Status'], axis = 1)
y = df['R Score']
In [42]:
# Splitting the data into training data and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 5)
linear_regressor = LinearRegression()

# Fitting the Data
linear_regressor.fit(X_train, y_train)

# Predicting the Target Variable
y_pred = linear_regressor.predict(X_test)
In [45]:
# Plotting the true and predicted value 
fig = plt.figure(figsize =(20, 10))
sns.regplot(y_test, y_pred)
plt.xlabel("R Score Test")
plt.ylabel("R Score Predicted")
plt.title("True Value vs Predicted Value")
plt.show()
C:\Users\KRISH\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning:

Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.

In [57]:
# Evaluation Metrics 
print("The accuracy of the model is", linear_regressor.score(X_test, y_test))
print("The r2 score of the model is", r2_score(y_test, y_pred))
print("The Absolute Mean Error of the model is", mean_absolute_error(y_test, y_pred))
print("The Mean Squared Error of the model is", mean_squared_error(y_test, y_pred))
The accuracy of the model is 1.0
The r2 score of the model is 1.0
The Absolute Mean Error of the model is 7.536059318667729e-15
The Mean Squared Error of the model is 8.41451632235746e-29
In [58]:
print("The intercept of the model is", linear_regressor.intercept_, "\n\nThe weights of the features are as follows:\n", linear_regressor.coef_)
The intercept of the model is -3.552713678800501e-14 

The weights of the features are as follows:
 [-6.18547845e-18  1.14383311e-15 -1.85103699e-16 -1.72212377e-15
 -2.95503785e-16 -3.04147391e-16 -2.50319851e-16 -4.97898256e-16
  2.58678808e-16 -5.11337749e-17  3.22046882e-16 -1.33074664e-16
  9.20689422e-16 -1.92055263e-16 -1.00830170e-16 -1.61878490e-16
  1.00000000e+00]
In [59]:
# Equation of the model

def predictRScore(regressor, x):
  intercept = regressor.intercept_
  weights = regressor.coef_
  equation = 0

  for i in range(len(weights)):
      equation += (weights[i]*x[i])
  equation += intercept
  print("The R Score is " + str(equation))